47 research outputs found
Learning to embed semantic similarity for joint image-text retrieval
We present a deep learning approach for learning the joint semantic
embeddings of images and captions in a Euclidean space, such that the semantic
similarity is approximated by the L2 distances in the embedding space. For
that, we introduce a metric learning scheme that utilizes multitask learning to
learn the embedding of identical semantic concepts using a center loss. By
introducing a differentiable quantization scheme into the end-to-end trainable
network, we derive a semantic embedding of semantically similar concepts in
Euclidean space. We also propose a novel metric learning formulation using an
adaptive margin hinge loss, that is refined during the training phase. The
proposed scheme was applied to the MS-COCO, Flicke30K and Flickr8K datasets,
and was shown to compare favorably with contemporary state-of-the-art
approaches.Comment: in IEEE Transactions on Pattern Analysis and Machine Intelligence,
202
Camera Pose Auto-Encoders for Improving Pose Regression
Absolute pose regressor (APR) networks are trained to estimate the pose of
the camera given a captured image. They compute latent image representations
from which the camera position and orientation are regressed. APRs provide a
different tradeoff between localization accuracy, runtime, and memory, compared
to structure-based localization schemes that provide state-of-the-art accuracy.
In this work, we introduce Camera Pose Auto-Encoders (PAEs), multilayer
perceptrons that are trained via a Teacher-Student approach to encode camera
poses using APRs as their teachers. We show that the resulting latent pose
representations can closely reproduce APR performance and demonstrate their
effectiveness for related tasks. Specifically, we propose a light-weight
test-time optimization in which the closest train poses are encoded and used to
refine camera position estimation. This procedure achieves a new
state-of-the-art position accuracy for APRs, on both the CambridgeLandmarks and
7Scenes benchmarks. We also show that train images can be reconstructed from
the learned pose encoding, paving the way for integrating visual information
from the train set at a low memory cost. Our code and pre-trained models are
available at https://github.com/yolish/camera-pose-auto-encoders.Comment: Accepted to ECCV2
Paying Attention to Multiscale Feature Maps in Multimodal Image Matching
We propose an attention-based approach for multimodal image patch matching
using a Transformer encoder attending to the feature maps of a multiscale
Siamese CNN. Our encoder is shown to efficiently aggregate multiscale image
embeddings while emphasizing task-specific appearance-invariant image cues. We
also introduce an attention-residual architecture, using a residual connection
bypassing the encoder. This additional learning signal facilitates end-to-end
training from scratch. Our approach is experimentally shown to achieve new
state-of-the-art accuracy on both multimodal and single modality benchmarks,
illustrating its general applicability. To the best of our knowledge, this is
the first successful implementation of the Transformer encoder architecture to
the multimodal image patch matching task
Hierarchical Attention-based Age Estimation and Bias Estimation
In this work we propose a novel deep-learning approach for age estimation
based on face images. We first introduce a dual image augmentation-aggregation
approach based on attention. This allows the network to jointly utilize
multiple face image augmentations whose embeddings are aggregated by a
Transformer-Encoder. The resulting aggregated embedding is shown to better
encode the face image attributes. We then propose a probabilistic hierarchical
regression framework that combines a discrete probabilistic estimate of age
labels, with a corresponding ensemble of regressors. Each regressor is
particularly adapted and trained to refine the probabilistic estimate over a
range of ages. Our scheme is shown to outperform contemporary schemes and
provide a new state-of-the-art age estimation accuracy, when applied to the
MORPH II dataset for age estimation. Last, we introduce a bias analysis of
state-of-the-art age estimation results.Comment: 11 pages, 7 figure